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Abstract 

The order of letters is not always relevant in a communication task. This paper discusses the 
implications of order irrelevance on source coding, presenting results in several major branches of 
source coding theory: lossless coding, universal lossless coding, rate-distortion, high-rate quantization, 
and universal lossy coding. The main conclusions demonstrate that there is a significant rate savings when 
order is irrelevant. In particular, lossless coding of n letters from a finite alphabet requires O(logn) bits 
and universal lossless coding requires n + o(n) bits for many countable alphabet sources. However, 
there are no universal schemes that can drive a strong redundancy measure to zero. Results for lossy 
coding include distribution-free expressions for the rate savings from order irrelevance in various high- 
rate quantization schemes. Rate-distortion bounds are given, and it is shown that the analogue of the 
Shannon lower bound is loose at all finite rates. 
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"ceiiinosssttuv" — Robert Hooke in 1676, establishing priority for the statement 

"ut tensio sic vis" (as is the extension, so is the force) published in 1678 [1]. 

I. Introduction 

Are there situations where claude shannon is no different than a sound channel; where 
maximum entropy might be considered a reasonable reconstruction for momentary mixup^If one 
is interested in communicating a textual source, an anagram is not a sufficient representation. If however, 
one is simply interested in classifying the language, an anagram may be sufficient because it gives the same 
first-order language approximation^ That is to say, the values of the letters (a multiset) may be important 
when the order of the letters (a permutation) is not. The order of source letters is irrelevant in a multitude 
of scenarios beyond such language representation and related texture representation applications [4]. 
Examples include warehouse inventories [5]; records in scientific or financial databases [6]; collections 
of multimedia files; arrival processes [7]; visual languages that have multiset grammars [8]; data in 
a parallel computing paradigm [9]; and the channel state in chemical channels [10]. Moreover, it has 
been suggested that when humans use data for recognition or recall tasks [11], or for judgments of 
coincidences [12], [13], the order of symbols is not relevant. 

If a sample consists of independent observations from the same distribution, then associated mini- 
mum variance unbiased estimators are symmetric in the observations [14]. Therefore when coding for 
estimation, the multiset of observations is all that need be represented, cf. [15]. Moving beyond the 
point-to-point case, in distributed inference, often particle-based [16], [17] and kernel-based [18], [19] 
representations of densities must be communicated. As 



for any permutation 7r(-), the multiset of representation coefficients {xj} may be communicated rather 
than the sequence of these values (x{). This extends to any destination that performs permutation-invariant 
computations. 

The aim of this paper is to develop ramifications of order irrelevance on source coding problems. We 
consider lossless coding, universal lossless coding, high-rate and low-rate quantization, and rate distortion 

'Anagrams due to R. J. McEliece, 2004 Shannon Lecture. 

2 In Shannon's sense of language approximation [2], first-order approximation requires that the distribution of letters matches 
the source, second-order approximation requires that the distribution of digrams matches the source, and so on; see also classical 
criticisms to this method of approximation [3]. 
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theory. In all of these, the rate requirement is obviously reduced by making order irrelevant. The reduction 
can be dramatic: for lossless coding of n symbols, the required rate is changed from 0(n) to O(logn); 
for lossy coding, with large enough n arbitrarily small mean squared error (MSE) can be achieved with 
zero rate. These examples are made precise in the sequel. 

A. Notation and Formalism 

We consider the encoding of multisets and sequences of letters of size n drawn from an alphabet X. 
When X is discrete, we take it to be the (possibly-infinite) set {1, 2, ... , \X\\ without loss of generality. 
In addition to standard uses of parentheses, we will also use parentheses and braces to distinguish 
between sequences and multisets. For the ordered sequence X\, X2, • • • , X n that is often denoted X™ 
in the information theory literature, we write pQ)™ =1 . When these n symbols are taken as an unordered 
multiset, we write {Xj}™ =1 . The range limits are often omitted. We will refer to the distribution that 
describes (X{)f =1 as the parent distribution. 

There are two perspectives that can be taken to relate (order-irrelevant) source coding of {Xi}f =1 
with (standard) source coding of (JQ)™ =1 . In either case we take the sample space to be the set 
of all sequences X n . Under the first perspective, which we take when discussing lossless coding, we 
define the event algebra, T, based on permutation-invariant equivalence classes of sequences, rather than 
the sequences themselves. Since events are defined in terms of multisets, the source coding problem is 
formally no different than the standard one, though the results are interestingly different. 

For lossy coding, we take an alternative perspective where the event algebra is based on sequences. 
Order irrelevance is incorporated by considering fidelity criteria with a permutation-invariance property. 
These fidelity criteria cannot be stated in single-letter terms, thus calling the mathematical tractability 
into question. However, like the non-single-letter fidelity criteria in [20], [21], our fidelity criteria are 
tractable and also bear semantic significance on several applications. 

All logarithms use base 2, and all rates are thus given in bits. In this paper, rates are generally not 
normalized by the number of symbols n. The reason for this departure from convention will soon become 
clear: the total number of bits required in some problems scales sublinearly with n. We use standard 
asymptotic notation such as o(-), O(-), O(-) and B(-) [22]. 

B. Outline 

The remainder of the paper is organized as follows. In Section [TTJ we propose the transformation of a 
sequence into an order and a multiset of values, and we show that the order and values are independent 
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when letters are produced i.i.d. Sections HIT] and JV] address lossless coding. First we consider lossless 
coding with known distribution for both finite and countably -infinite alphabets in Section Hill Section Hvl 
considers the universal setting and provides both a positive result (achievability of a rather low coding 
rate of 1 bit per letter) and a negative result (unachievability of negligible redundancy). 

In Sections [V] and |VlJ we turn to lossy coding. In Section IV-AI it is shown that, for a large class 
of sources, a natural rate-distortion function is the trivial zero-zero point. This inspires restriction to 
finite-sized blocks in Sections IV-BI and IV-CI These sections discuss the rate-distortion functions for 
the discrete- and uncountable-alphabet cases, respectively. Section IV-CI also presents several high-rate 
quantization analyses. Universal lossy coding is considered in Section [VI] Finally, Section rvTTI concludes 
the paper with a discussion of intermediates between full relevance and full irrelevance of order; additional 
connections to related work; and a summary of the main results. 

Many results presented here appeared first in [23]-[26]. 

II. Separating Order and Value 

Consider source variables Xi, X 2 , ■ ■ ■ , X n drawn from a common alphabet X according to any joint 
distribution. A realization (xj)™ =1 can be decomposed into a multiset of values {xj}" =1 and an order j. 
This can be expressed as 

(xt, x 2 , x n ) = (y ix , y i2 , y in ) — >i 12 J = j 3 J, (1) 

\X! x 2 ■■■ x n J \ {«i}" =1 j 

where (yi)f =1 is (xj)" =1 put into a canonical order^ The indices i\, i 2 , ■ ■ ■ , i n are a permutation of the 
integers 1, 2, . . . , n and a deterministic function of (xi)f =l ; when the x»s are not distinct, we require 
any deterministic mechanism for choosing amongst the permutations such that £[]) holds. The ordering 
is collapsed into a single variable j, which defines a chance variable J. 

The decomposition into order and value can be interpreted as the generation of "transform coef- 
ficients." Whether decomposing signals into low frequency and high frequency [27]; predictable and 
unpredictable [28]; style and content [29]; object and texture [30]; or dictionary and pattern [31], divide - 
and-conquer approaches have been used to good advantage in many source coding scenarios. Here, we 
are concerned with situations in which the order J is irrelevant and hence allocated no bits. In contrast, 

3 We could say that y is the sorted version of x, but we want to emphasize that there is no need at this point for a meaningful 
order for X. 
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allocating all the bits to J yields permutation source codes [32]. Other rate allocations are discussed 
in [23]. 

If the joint distribution of (Xj)™ =1 is exchangeable, J and {^Q}™ =1 are statistically independent chance 
variables [33]. Thus they could be coded separately without loss of optimality. This is expressed in 
information theoretic terms by the following theorem. 

Theorem 1. For exchangeable sources, the order and the value are independent, and the sequence entropy 
H((Xi)f =1 ) can be decomposed into the value entropy H({Xi}f =1 ) and the order entropy H(J): 

H({X l )U) = H({X l YU) + H(J). (2) 
Proof: Suppressing unnecessary subscripts, we can write 

H{(X)) { ^ H{(X))+H{{X})-H{(X)\J) = H({X}) + I((X) ; J) 
= H{{X}) + H(J)-H{J\(X)) ® H({X}) + H(J). 

Step (a) follows from noting that H({X}) = H((X)\J) for exchangeable sources, since all orderings 
are equiprobable and uninformative about the value. Step (b) follows from the fact H{J\(X)) = 0, since 
the sequence determines the order. The other steps are simple informational manipulations. ■ 

When we disregard order, we are just left with a multiset. Type classes — also known variously as 
histograms or empirical distributions in statistics and as rearrangement classes or Abelian classes in 
combinatorics — are complete, minimal sufficient statistic for multisets. For discrete-alphabet sources, 
types are convenient mathematical representations for multisets; several results for these sources will 
depend on counting numbers of type classes. 

Despite being sufficient statistics, types are not useful in the representation of continuous-alphabet 
sources. As Csiszar [34] writes: "extensions of the type concept to continuous alphabets are not known." 
In the case that X = M., we make extensive use of the basic distribution theory of order statistics. When 
the sequence of random variables X\, . . . , X n is arranged in ascending order as X( 1;n ) < ■ ■ ■ < -X"( n:n ), 
X/ r:n \ is called the rth order statistic. It can be shown that the order statistics for exchangeable variates 
are complete, minimal sufficient statistics [35]. For alphabets of vectors of real numbers, there is no 
simple canonical form for expressing the minimal sufficient statistic since there is no natural ordering of 
vectors [36]. 
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III. Lossless Coding 

Consider the lossless coding of multisets of n letters drawn from the discrete alphabet X. Since there 
are n! permutations of a sequence of length n, it would seem that a rate savings of log(n!) relative to 
sequence coding might be possible. Specifically, the upper bound H(J) < log(n!) combined with (f2]) 
gives the lower bound 

# ({^}?=i) > ff(M?=i) " logn!. (3) 

Since this lower bound can be negative, there must be more to the story. 

The lower bound is not tight due to the positive chance of ties among members of a multiset drawn 
from a discrete parent. If the chance of ties is small (if \X\ is sufficiently large and n is sufficiently 
small), the lower bound is a good approximation. 

For any given source distribution and n, a multiset of samples {xi}f =1 can be cast as a superletter 
drawn from an alphabet of multisets, itself having a known distribution. By the lossless block-to-variable 
source coding theorem [2], the entropy of the superletter is an asymptotically tight lower bound on the 
rate required for representation. Since the type specifies the multiset, as mentioned in Section [TTJ we can 
write 

H({X^ =1 ) = H(K U K 2 , . . ,,K\ X \), (4) 

where Ki is the number of occurrences of Xi in n trials. While we will exhibit a few explicit calculations, 
our main interest is in relating the rate requirement to the sample size n. For this we first consider finite 
alphabets and then infinite alphabets. 



A. Finite Alphabets 

If the multiset is drawn i.i.d., the distribution of types is given by a multinomial distribution derived 
from the parent distribution [37, Problem VI]. Suppose Xi S X has probability pi in the parent. Then 




for any type (k±, k 2 , ■ ■ ■ , kix\) °f non-negative integers with sum n. 

The simplest case of a binary source (\X\ = 2) gives K\ ~ binomial(n,p) and K 2 = n — K\, 
where p = Pr[JTj = 1]. Then since K 2 is a deterministic function of K\, we have the simplification 
H{K U K 2 ) = H{Kx). Now we have 

_. oo 

H{{X^ =1 ) = H(Ki) = - log(27rep(l - p)n) + ]T a k n~ k (5) 

k=l 
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for some constants a±, 02, .... The leading term can be obtained with the de Moivre approximation of a 
binomial random variable with a Gaussian random variable [37, pp. 243-259]; the full expansion requires 
more sophisticated techniques [38]. 

To emphasize the dependence on n, note that the rate in (f5]) is | logn + c + o(l), where c is some 
constant. We will now see that 0(log n) lossless coding rate extends to all finite-alphabet multiset sources. 

Theorem 2. Let \X\ be finite. Then H({Xi}f =1 ) = O(logra). 

Proof: Denote the alphabet of distinct types by K(X,n). By simple combinatorics [34], 

DC(Ar i n)| = ( n ^J- 1 )<(n+ 1)1*1 (6) 

Recalling the equality (O, the desired entropy H({Xi}f =1 ) is upper-bounded by the logarithm of \KL{X, n)\. 
Thus, 

H{{Xi} r U) < |*|log(n + l) = 0(logn) 

since \X\ is finite. ■ 
Note that the theorem holds for any source, not just for i.i.d. sources. For a non-trivial i.i.d. source we 
can use the calculation ([5]) to show an Sl(logn) lower bound, so in fact we have H({Xi}f =1 ) = 0(logn). 

For an i.i.d. source, the upper bound in the proof is quite loose. To achieve the bound with equality, each 
of the types would have to be equiprobable; however by the strong asymptotic equipartition property [39], 
collectively, all non-strongly typical types will occur with arbitrarily small probability. The number of 
types in the strongly typical set is polynomial in n, so any upper bound would still be 6(logn). 

B. Countable Alphabets 

Theorem [2] with its presented proof obviously does not extend to infinite alphabets. To get an interesting 
bound we must do more than enumerate types. 
Define the entropy rate of a multiset as 

H(X) = lim ±H({Xi}U)- (7) 

n— >oo 

Theorem |2] shows that finite-alphabet sources yield multisets with zero entropy rate. Using a dictionary- 
pattern decomposition, we will show a related result for countable-alphabet sources. 

A sequence may be decomposed into a dictionary, A, and a pattern, (^i), where the dictionary specifies 
which letters from the alphabet have appeared and the pattern specifies the order in which these letters 
have appeared [31]. For a sequence (xj), dictionary entry 6k G X is the kth distinct letter in the sequence 
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and pattern entry ^ G Z + is the dictionary index of the ith letter in the sequence. Note that the type of 
a pattern, denoted as {^i}, and the pattern of a multiset, denoted as are the same. It can be 

seen that a multiset is determined by A and the order of the pattern, «/(*), is not needed. 

Based on [22], we show that the entropy rate of a multiset generated by a discrete finite-entropy 
stationary process and the entropy rate of its pattern are equal. First note that the dictionary of a sequence 
and the dictionary of its associated multiset are the same; we use A to signify either one. Since {X{\ 
determines Vl/({JQ}) and since given {^i}, there is a one-to-one correspondence between {Xi} and A, 

H({Xi}) = H(m) + H({Xi] | {*,}) = H(m) + H(A | {¥,}). 

If we can show that H(A\{^i}) is o(n), then it will follow that the entropy rate of the multiset, H(X), 
is equal to the entropy rate of the pattern of the multiset 

H(V) = lim \H({^ t YU)- 

n — >oo 

Noting the fact that the dictionary A is independent of the order of the pattern, J{^>), we establish 
that 

Km Iff(A|(* i )?=i)= f Jim > ^(A| J((^)? =1 ), {^}? =1 ) 

= lim iff(A i mu\ 

where the first step follows since the order and type of pattern determine the pattern, and the second step 
is due to independence. The result [22, Theorem 9] shows that for all finite-entropy, discrete stationary 
processes, the asymptotic per-letter values of H{(^i))) and H((Xi)) are equal. Hence, 

lim l -H(A | (*02=i) = 0, 

n— >oo 

and so 

lim Iff(A | mti) = 0. 

n^oo 

This implies that H(A\{^>i}) is o(n) and yields the following theorem. 

Theorem 3. The entropy rate of a multiset generated by a discrete finite-entropy stationary process and 
the entropy rate of its pattern coincide: 

H(X) = # (¥). 

Computing the entropy rate of the multiset or equivalently of the pattern of the multiset can be 
difficult. See [40] and references therein for a discussion on computing the entropy of patterns; the 
entropy computation for patterns of multisets is closely related. 
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IV. Universal Lossless Coding 

The previous section considered source coding for multisets when the source distribution was known. 
Most prominently in estimation and inference but also in other applications, the underlying distribution is 
not known. In this section, we discuss universal source coding of multisets, first presenting an achievability 
result for countable alphabets and then showing that redundancy, defined in a stronger sense than usual, 
cannot be driven to zero. 

A. Universal Achievability 

We propose a source coding scheme that achieves some degree of compression for all members of a 
source class at the same time. We will not compare to the entropy bound, holding off detailed discussion 
of redundancy until Section IIV-BI 

Consider classes of countable-alphabet i.i.d. sources that meet Kieffer's condition for universal encod- 
ability for the sequence representation problem [41], [42]. For these source classes, the redundancy for 
encoding pQ)" =1 is o(n). We formulate a universal scheme for the multiset representation problem and 
demonstrate an achievability result, making use of the dictionary-pattern decomposition. 

As we saw in Section IIII-B I a multiset can be represented as the concatenation of the pattern of the 
multiset and the dictionary. Consider the rate requirements of these two parts separately. First, let us 
bound the rate that is required to represent the pattern of the multiset (the type of the pattern). We 
can make use of the fact that there are 2 n ~ 1 types of patterns. This enumeration follows because the 
types are sequences of positive integers that sum to n. These can appear in any order, thus we are 
counting ordered partitions. It is well known that there are 2™" 1 ordered partitions, which can be seen 
as determining arrangements of n — 1 possible separations of n places. Thus the rate requirement for an 
enumerative universal scheme representing the type of the pattern is n — 1 bits. 

Now to determine the rate requirement of the dictionary given the type of pattern. If the underlying 
distribution were known, we saw in Theorem [3] that H(A\{^,i\f =1 ) is o(n). It was shown in [31] that 
there is an 0(y/n) upper bound on the pattern redundancy, independent of \X\. Since this is sublinear, 
the asymptotic per-letter redundancy in coding a class of sequences (^Q)?=i from a countable alphabet 
coincides with the asymptotic per-letter redundancy in coding the dictionary given the pattern. Since we 
are considering a class that meets Kieffer's condition, we find the redundancy in coding the dictionary 
given the pattern is o(n). This also carries over to coding the dictionary given the pattern of the multiset, 
due to the independence between dictionary and order of pattern that we had put forth in Section IIII-B I 
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Fig. 1. A histogram for interpretation of Theorem [4] Starting at the left and using the rule described in the text, the histogram 
is encoded by 01001100000111010001. 



Since both H(A\{^i}f =1 ) and the redundancy in coding the dictionary given the pattern of the multiset 
are o(n), the total rate requirement for coding the dictionary given the pattern of the multiset is o(n). 

Adding together the rate requirements for the two parts yields the following achievability theorem. 
The rate requirement is universally reduced from [0, oo) bits per letter for the sequence problem to 1 bit 
per letter for the multiset problem. 

Theorem 4. Given any i.i.d. source class that is universally encodable as a sequence, the multiset 
{Xj}™ =1 can be encoded with n + o(n) bits. 

Proof: A representation consists of the concatenation of the type of pattern and the dictionary given 
the type of pattern. The first part requires n — 1 bits. The second part requires o(n) bits. The total rate 
is then n + o{n). ■ 
Coding a multiset is equivalent to coding a type or histogram. An interpretation of Theorem [4] is thus 
that histograms, from a certain class, with total weight n can be encoded with n + o(n) bits. Figure Q] 
shows a histogram. An encoding method that shows the plausibility of n + o(n) total rate is to encode 
the histogram one letter at a time, starting from the left end and moving to the right. After each letter, 
use to indicate that there is another occurrence of the same letter and use 1 to move on to the next 
letter. If every symbol in the alphabet appears at least once, the rate is n; as long as the right-most letter 
encountered does not grow too quickly with n, an n + o(n) rate is achieved. 

B. Unattainability of Negligible Redundancy 

For finite alphabets, we showed that the multiset entropy rate is zero for any source; this is a crude 
consequence of Theorem [2l We also saw, in the proof of Theorem [2j that simply enumerating the type 
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classes requires zero rate per multiset letter asymptotically. Hence such a universal scheme requires zero 
rate for any finite-alphabet source. However, we cannot conclude from "zero equals zero" that the excess 
rate due to not knowing the source distribution is negligible. 

In this section we take a finer look at universal lossless coding of multisets. We will see that for a 
finite-alphabet source, the redundancy cannot be made a negligible fraction of the coding rate. Since 
the coding rate with full distributional knowledge is B(logn), we come to this conclusion using new 
information and redundancy measures that have normalization by logn rather than by n. We will find 
that zero-redundancy universal coding of multisets is not possible with respect to the class of memoryless 
multisets, using the more stringent redundancy definition. Zero redundancy is thus also not possible for 
more general classes of sources such as sources with memory or with infinite alphabets. 

1 ) Log-Blocklength Normalized Information Measures: We formulate several definitions and extend the 
source coding theorems to these definitions. Let Z\, Z\, ... represent a sequence of random variables over 
a sequence of alphabets Z\, Z2, ■ ■ .. (For sequence coding we would have Z r { = pQ)™ =1 and Z n = X n 
is an alphabet of sequences. For multiset coding we would have = {Xj}™ =1 and Z n = K(X , n) is 
an alphabet of types.) Define the log-blocklength normalized entropy rate as 

*(3) = Km 

n^oo log n 

when the limit exists. With conditioning on another random variable 0, define the log-blocklength 
normalized conditional entropy rate as 

fl(3 1 ei = >, m ^im 

n~ »oo log n 

when the limit exists. Similarly, define the log-blocklength normalized information rate as 

3(3; e)=i im ^ 

rc— >oc log n 

when the limit exists. While these definitions parallel the standard definitions, none of the limits would 
generally exist for sequence coding because the numerators grow linearly with n. 

For each n G Z + , let (p n be a source code for random variable Zf . For this sequence of source codes, 
the average codeword lengths are 

where £(■) is the length of the codeword assigned to the source realization z". Shannon's fixed-to-variable 
source coding theorem [2] establishes that there exists a sequence of source codes that satisfy the following 
inequalities for all n: 

H(Z?) < CV,„ < H(Z?) + 1. 
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Dividing through by log n yields 

H(Z{ 1 ) < O hlL< H{Z?) + 1 
logn ~~ logn ~~ logn 

Taking the limit of large blocklength (n — ► oo), we see that when the limits exist there is a sequence of 

source codes that achieves fj(3)- 

2) Log-Blocklength Normalized Redundancy Measures: Define the redundancy of a source code, r$ n , 

as the excess average codeword length that is required over the minimum H{Z™)\ 

r<j>,n = C<p,n - H(Zi). 

Finally, define the log-blocklength normalized redundancy of a sequence of source codes as 
lim 1*0- = lim C ^~ H ™ = l im ^£L _ m £ ^ - fl(3). 

n— >oo log n n— »oo log n n^oo log n 

By the manipulations of the source coding theorem that we had made previously, we know that there is a 
sequence of codes with £ = 0. The code used to develop the upper bound in the source coding theorem, 
however, requires that pz?{z\) is known. 

Now we define performance measures for source coding for a class of source distributions, rather than 
just a single source distribution. The definitions parallel those of [43]. Suppose that the source distribution 
is chosen from a class that is parameterized by G T. For each 9, there is a conditional distribution 

p(z? \6) = Pt [Z{ 1 = z?\e = 0]. 

The parameter is fixed but unknown, when generating the source realization. Moreover, there may be a 
distribution on this parameter, pe(0). Let <£ n be the set of all uniquely decipherable codes on Zf . Then, 
the average log-blocklength normalized redundancy of a code (f> G for the class of sources described 
by pe{0) is 

£*,n(pe)= / ^Pe{0)d9. 
J T logn 

The minimum nth-order average log-blocklength normalized redundancy is 

£n(P&) = , in | £ip,n{P<3>)- 
Finally, the minimum average log-blocklength normalized redundancy is 

C*( Pe ) = lim £*(pe). 
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If £*(p@) = 0, then a sequence of codes that achieves the limit is called weighted log-blocklength 
normalized universal. Now let T be the set of all probability distributions defined on the alphabet T. 
Then the nth-order maximin log-blocklength normalized redundancy of T is 

£~ = sup C* n {qe). 

qc->eT 

If it exists, then the maximin log-blocklength normalized redundancy is 

£~ = lim C~. 

n^oo 

If C~ = 0, then a sequence of codes that achieves the limit is called maximin log-blocklength normalized 
universal. The nth-order minimax log-blocklength normalized redundancy of T is 

Cl = inf sup f ,n ^ - 

and the minimax log-blocklength normalized redundancy of T is 

£+ = lim £+. 

n— >oo 

If £ + = 0, then a sequence of codes that achieves the limit is called minimax log-blocklength normalized 
universal. 

3) Redundancy-Capacity Theorems: The senses of universality that we have defined obey an ordering 
relation. 

Theorem 5. The log-normalized redundancy quantities satisfy 

C+ >£-> C* n ( P e) 

and 

C + > CT > £*(pe). 

Proof: Minor modification of [43, Theorem 1]. ■ 
Armed with definitions and relations among several notions of log-blocklength normalized universality, 
we now study when it is possible to achieve universality. We give a theorem that gives a necessary and 
sufficient condition on the existence of weighted log-blocklength normalized universal codes. 

Theorem 6. The minimum nth-order average log-blocklength normalized redundancy is bounded as 

^< £ . (P6) <£(^) + ' 

log n log n log n 
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A necessary and sufficient condition for the existence of weighted log-blocklength normalized universal 
codes is that 

C*( Pe ) = lim £* (p ) = lim /( f" ;9) = 3(Z; G) = 0. 

n— >oo n— >oo log 71 

Proof: Minor modification of [43, Theorem 2]. ■ 
Theorem [6] can be extended to conditions for minimax and maximin log-blocklength normalized 

universality and can also be strengthened by suitable modification of theorems in [44], [45]. 

4) Class of Memoryless Multisets: Consider the class of memoryless, binary multisets. The parameter 

9 is the Bernoulli trial parameter. Now suppose that there is a distribution over the parameter space 

qs(9) G T that is uniform over [0, 1]. This gives a mixed source where all type classes are equiprobable, 

as shown now. 

Let the realizations of {Xj}" =1 be expressed as z G {0, . . . , n}, the number of ones. 



PrpQKt! = z}= f q e (6) Pr[{X,,}?=i = z\Q 
Jo 



6} de, 



where qe(9) simply equals one over the range of integration, and Pr[{Xj}™ =1 = z | G = 6] is given by 
a binomial distribution: 



So, 



Pr[{X ; };U - .: \<-)-(->}- [ n z )0--(\- t-»> 



Pr[{Xi}f =1 = z] = J Q - 9) n - z d9 = ( n )\ B(z + l,n-z + l) 1 



k Z J 1 + 71 

where the beta function £?(•, •) has been used. Thus, the result is that the type classes are equiprobable. 

Since the types are equiprobable, the entropy H({Xi}f =1 ) is just the logarithm of the number of the 
types: 

H({Xi}U) = log ^ + \x*l i l ) = l °s(n + 1). 

The entropy conditioned on G = 9 is simply the entropy of a binomial random variable [38] and so the 
conditional entropy is 

H{{X,^ =1 | 9) = J H({X^ =1 | 9)q e (9) d6 = J [\ log (2vren0(l - 9)) + £ fc >i a k n~ k ] q e (9) d9 
= ilogn + ^n- fc I a k q@{9)d9+ j \ log (27re0(l - 9)) q e {9) d9 

k>l J J 

= g log n + o(log n), 
where the a^s are known constants given in [38]. 



February 1, 2008 



DRAFT 



14 



Now the mutual information is I({Xi}f =1 ] 6) = H({Xi}? =1 )-H({Xi}? =1 | 6), so the log-blocklength 
normalized information rate is given by 

J = Hm I({Xj}? =1 ;G) _ Um log(n + l)-±logn-o(logn) 



n— >oo 



log n n— »oo log n 

log(n + 1) - | logn 1 log(n + l) 1 
= hm = lim 1 = -. 

n— >oo logn n— >oo 2 logn 2 

Since this is greater than 0, we have shown that the weakest form of universality is not possible, by 
Theorem [6] Thus by Theorem [5j stronger forms of universality are not possible either. Since the class 
of binary memoryless sets is a subset of more general source classes such as memoryless; Markov; and 
stationary, ergodic, universal source coding over these source classes is not possible either. 

We can calculate the weighted redundancy for classes of memoryless sources with larger alphabet sizes. 
Using the pe(9) that yields equiprobable multisets and the conditional entropy given by the entropy of 
a multinomial random variable [38], we are interested in 

iQ g rSfir 1 ) - lew - i)iog(ifn) - o(i) _ i 

lim ■ — ■ = , 

rwoo log n 2 

where K is a known constant. As we can see, this redundancy grows without bound as the alphabet 

size increases. Perhaps unsurprisingly, this redundancy expression is reminiscent of the unnormalized 

redundancy expression for i.i.d. sequences [46]: 

\X\ - 1 , n , rl*l (1/2) 

log — + tog — +o\ X \(l). 



2 to 2vr to r (|Af|/2) 
We see that the richness of a class of sources as sequences is the same as the richness of that class as 
multisets. 

Let us also comment that in the deterministic case of individual multisets, rather than the probabilistic 
classes of sources that we have been considering, the same non-achievability result applies. This follows 
from the arguments summarized in [47]. 

V. Lossy Coding 

In the two previous sections, we have considered lossless representation of multiset sources with 
discrete alphabets. Now in this section and Section |VlJ we look at lossy coding with both discrete and 
continuous alphabets. 
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A. Large-Size Multiset Asymptotics 

1 ) Multiset Mean Squared Error: Assume that the source alphabet X is a subset of the real numbers. 
In cases of interest — for example, when every (Xj)™ =1 has a probability density — multisets drawn from 
these alphabets are almost surely sets. (Simply, ties have zero probability.) Thus, the type is a list of 
n values that each occurred once. This list is conveniently represented with order statistics. Recall that 
X( r:n ) denotes the rth order statistic from a block of n, which is the rth-largest of the set {Xj}" =1 . 

Define a word distortion measure as 

1 n 

Pn(x^, y^) = - ^2(x {i:n) - y (i:n )) 2 , (8) 

i=l 

and an associated fidelity criterion as 

F 1 = { Pn (x?,y?),n = l,2,...}. (9) 

Although not a single-letter fidelity criterion, it is single-letter mean square error on the block of order 
statistics. 

If {Xi}f =1 is reconstructed by {1^}" =1 , the incurred distortion is 

1 n 

8=1 

If we use no rate, then the best choice for the reconstruction is simply ya :n -\ = Ei[Xr i:n \], i = 1, 2, . . . , n, 
and the average incurred distortion reduces to 

D n (R = 0) = -^var (X (i:n) ) . 

i=l 

Before proceeding with a general proof that D(R = 0) = lim n _^oo D n (R = 0) = 0, we give some 
examples. For multisets with elements drawn i.i.d. from the uniform distribution with support [— v3, \/3] , 
from the Gaussian distribution with mean zero and variance one, and from the exponential distribution 
with mean one, Figure |2] shows the nth-order distortion-rate function. This is the average variance of the 
order statistics. It can be shown that all of these bounded, monotonically decreasing sequences of real 
numbers, {D n (0)}, have limit [25]. In fact all of these sequences decay as 0(l/n). Hence for zero 
rate, there is zero distortion incurred. 

The result that the rate-distortion function is the zero-zero point, along with the distortion decay as 
a function of block size being 0(l/n), also holds for a large class of other sources. If we assume 
that the cumulative distribution function of the source is always differentiable (i.e. the density function 
Px{x) exists) and that px{%) > for all x G R, then the same result holds. This follows from the 
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Fig. 2. Distortion at zero rate D n (0) as function of multiset size n for several sources. 



asymptotic fixed variance normality of yn-normalized central order statistics [48, Corollary 21.5]. Notice 
that although this class of sources is very large, two of our examples were not members. Thus, we 
formulate an even more widely-applicable theorem on zero rate-zero distortion, though we no longer 
have a characterization of the decay rate. 

The general theorem will be based on the quantile function of the i.i.d. parent process; this is the 
generalized inverse of the cumulative distribution function, Fx(x), 

Q(w) = = inf{x : F x (x) > w}. 

The empirical quantile function, defined in terms of order statistics is 

Q n (w) = X([«,n)+l:n) = F~ 1 (w), 

where F n (-) is the empirical distribution function. The quantile function Q(-) is continuous if and only 
if the distribution function has no flat portions in the interior. The main step of the proof will be a 
Glivenko-Cantelli like theorem for empirical quantile functions [49]. 

Lemma 1. Let the letters to be coded, X±, X2, ■ ■ ■ , X^, be generated in an i.i.d. fashion according to 
Fx(x) with associated quantile function Q(w). Let X\ satisfy 



E 



min(Xi,0)| 1/!yi < 00 and E (max(Xi,0)) 



a/1/2 



< 00 (10) 



for some v\ > and vi > and have continuous quantile function Q(w). Then the sequence of 
distortion-rate values for the coding of size-n sets drawn from the parent distribution satisfy 

lim DJR = 0) = 0. 

n^oo 
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Proof: For any nonnegative function u denned on (0, 1), define a weighted Kolmogorov-Smirnov 
like statistic 

S n (u) = sup u(w) \Q n (w) - Q(w)\ . 

0<w<l 

For each v\ > 0, V2 > 0, and w G (0, 1), define the weight function 

Assume that Q is continuous, choose any v\ > and V2 > 0, and define 

7 = limsupS„(u; Vlil , 2 ). 

n— >oo 

Then by a result of Mason [49], 7 = with probability 1 when (TTOl ) holds. Our assumptions on the 
parent process meet this condition, so 7 = with probability 1. This implies that 

limsup |^(L«, n J+i:n) ~~ Q( w )\ — for all w € (0, 1) w.p.l, (11) 

n— >oo 

and since the absolute value is nonnegative, the inequality holds with equality. According to (fTTT >. for 
sufficiently large n, each order statistic takes a fixed value with probability 1. The bounded moment 
condition on the parent process, ([T0l) . implies a bounded moment condition on the order statistics. Almost 
sure convergence to a fixed quantity, together with the bounded moment condition on the events of 
probability zero imply convergence in second moment of all order statistics. This convergence in second 
moment to a deterministic distribution implies that the variance of each order statistic is zero, and thus 
the average variance is zero. ■ 
We have established that asymptotically in n, the point (R = 0, D = 0) is achievable, which leads to 
the following theorem. 

Theorem 7. Under fidelity criterion F\, R{D) = for an i.i.d. source that meets the bounded moment 
condition 4701 ) and has continuous quantile function. 

Proof: By the nonnegativity of the distortion function, D{R) > 0. By Lemma [H D(0) < 0, so 
.D(O) = 0. Since D(R) is a non-increasing function, D(R) = 0, and so R(D) = 0. ■ 
Due to the generality of the Glivenko-Cantelli like theorem that we used, the result will stand for a 
very large class of distortion measures. One only needs to ensure that the set of outcomes of probability 
zero is not problematic. 
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2 ) Arbitrary Single-Letter Distortions: While Theorem [7] applies to a large class of real-value parent 
distributions, it depends on the multiset squared error distortion measure ([8]) to make convergence of 
moments of the order statistics relevant. With arbitrary single-letter distortion measures we obtain a 
result analogous to Theorem [2] in that it shows that an O(logn) rate is sufficient for coding n letters 
accurately. 

Consider a source and single-letter distortion function d : X x X — » M + such that the rate distortion 
function (for encoding as a sequence) is Rx{D). For coding this source without regard to order, define 
a word distortion measure 

1 n 

Pn(xi,yi) = min- V y^)), (12) 
i=l 

where it is a permutation on {1, 2, . . . , n}, and an associated fidelity criterion 

F 1 = { Pn (x r t,y 7 ?),n = l,2,...}. (13) 

(We have re-used the notation from ©-(O since the per-letter MSE on order statistics defined there is 
a special case.) Denote the minimum (total) rate for encoding {X,}" =1 with E[pn(Xf, -^T )] — D by 
R{XA n _ (D). Then the following theorem bounds the growth of R{ Xi } n _ 1 (D) as a function of n. 

Theorem 8. If Rx(D) is finite, then for any e > 0, 

R{x t y Ll (D + e) = 0(logn). 

Proof: Let D be such that R = Rx{D) is finite and let e > 0. The achievability of R X (D) means 
there is sequence of dimension-n quantizers with 2 nR codewords such that lim^^oo E^X™, Xf )] < D. 
Thus, there exists finite N such a dimension-iV quantizer with 2^ codewords achieves distortion at 
most D + e. Applying this quantizer to blocks of length N of the source creates a finite-alphabet source 
that can be communicated as a multiset with O(logn) bits (Theorem [2]). The distortion with respect to 
(fT2l) does not exceed D + e. In fact, a somewhat more stringent distortion measure is held to at most 
D + e; this measure is of the form (fT2l ) with permutations tt limited to rearrangements that keep blocks 
of length N intact. ■ 

B. Coding of Finite-Size Multisets for Discrete-Alphabet Sources 

In Sections [TTlJ [IVl and IV-AI we allowed the multiset size to go to infinity. Such source coding incurs 
infinite delay, and in the case of Section |V-A| the source coding problem is trivialized. If we are concerned 
with delay, we would want to code short blocks at a time. In this section and the subsequent section, 
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we investigate bounds on coding when we restrict the multiset size to be fixed and finite. Then our 
asymptotic results are based on increasing the number of independent realizations of these finite-sized 
multisets. 

If we are concerned about lossless representation of each fixed multiset, then the rate requirement 
is simply lower bounded by the entropy. As we had discussed in Section JIIJ if the multiset elements 
are drawn i.i.d., then the entropy is the same as the entropy of a multinomial random variable. For 
example, Bernoulli^) multisets of size K have H{{Xi}f =1 ) » ± log 2 (27re.Kp(l - p)). The entropy 
lower bound assumes that we require the fixed-size multisets to be uniquely decipherable. If we insist on 
a slightly weaker requirement, where these multisets might become permuted, we can require multiset 
decipherability of the multiset representations. Notwithstanding the falsity of [5, Conjecture 3] (shown 
in [50]), the gains below entropy are minimal, and so we do not pursue this weaker requirement further. 

Rather than lossless coding, one might be interested in lossy coding of fixed-size multisets from discrete 
alphabets. We define a fidelity criterion for i^-size multisets 



The word distortion measure, dx, used to define the fidelity criterion takes value zero when x and y are 
in the same type class and one otherwise. One can also express the word distortion measure in group 
theoretic terms using permutation groups, if desired. This notion of fidelity casts the problem into a 
frequency of error framework on the types. Assuming that the multisets to be coded are independent and 
identically distributed, this is simply an i.i.d. discrete (finite or countable) source with error frequency 
distortion, so the reverse waterfilling solution of Erokhin [51] applies. The rate-distortion function is 
given parametrically as 



where Ng is the number of types whose probability is greater than 8 and Sg is the sum of the probabilities 
of these Ng types. The parameter 8 goes from to p(ft) as D goes from to -D max = 1 — p{y)', the 
most probable type is denoted v and the second most probable type is denoted £*. If the letters within 
the multisets are also i.i.d., the probability values needed for the reverse waterfilling characterization are 
computed using the multinomial distribution. 

Only the most probable source types are used in the representation alphabet. It is known that the 




D e = 1 



S + 8{N e - 1) 



-Re 



Y, POO logp(*) + (1 - D e ) log(l - D e ) + (Ng - 1)8 log 9, 



:p(i)>0 
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probability of type class k drawn i.i.d. from the a finite-alphabet parent px is bounded as follows [34]: 

1 o-raD(p fc ||px) < p-f/1 < O-n-D(Pfclbx) 
\K.(X,n)r - rr ^ - ^ 

where is a probability measure derived by normalizing the type k. The multiset types used in the 
representation alphabet are given by the type classes in the typical set 

Tff = {k : D{p k \\p x ) < e{9)} . 

Since multiset sources are successively refinable under error frequency distortion [52], scalable coding 
would involve adding types into the representation alphabet. 

In addition to F2, we can define other fidelity criteria that reduce the multiset rate-distortion problem to 
well-known discrete memoryless source rate-distortion problems. As a simple example, consider multisets 
of length K = 2 and consisting of i.i.d. equiprobable binary elements. Then there are three letters in 
the alphabet of types: {0,0}, {0, 1}, and {1, 1}, which can be represented by their Hamming weights, 
{0, 1, 2}. The probabilities of these three letters are {1/4, 1/2, 1/4}. Define the word distortion function 
using the Hamming weight, w h {■),'■ 

$( x vVi) = \ w h{x\) ~ w H{y\)\ ■ 



The fidelity criterion is 



2 n/2 

-E^-i,y|-i)^ = 2,4, 
i=i 



This is a single-letter fidelity criterion on the Hamming weights and is in fact the well-studied problem 
known as the Gerrish problem [53, Problem 2.8]. One can easily generate equivalences to other known 
problems as well. 

C. Coding of Finite-Size Multisets for Continuous-Alphabet Sources 

Now turning our attention to fixed-size multisets with continuous alphabets, we first see what simple 
quantization schemes can do, then develop some high-rate quantization theory results and finally compute 
some rate distortion theory bounds. 

1) Low-Rate Low-Dimension Quantization for Fixed-Size Multisets: Using the previously defined word 
distortion $8$ on blocks of length K, a new fidelity criterion is 

1X „ l^iK „.iK 



F ±=\-l^ PKWSt-K+i,ViX-K+i)>n = K,2K,...}. 
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This is average MSE on the block of order statistics. Notice that the fidelity criterion is defined only for 
words that have lengths that are multiples of the block size K. 

For low rates and coding one set at a time, we can find optimal MSE quantizers through the Lloyd-Max 
optimization procedure [54]. The quantizers generated in this way are easy to implement for practical 
source coding, and they also provide an upper bound on the rate-distortion function. Designing the 
quantizers requires knowledge of the distributions of order statistics, which can be derived from the 
parent distribution [35]. For X\, Xz, ■ ■ ■ , Xk that are drawn i.i.d. according to the cumulative distribution 
function F x (x), the marginal cumulative distribution function of Xr r:K \ is given in closed form by 

K /K\ 

F(r:K)(x) =J2{i [1 " F X (X)] K -* = I Fx{x) {r,K- V + 1), 

i=r ^ ' 

where I p (a, b) is the incomplete beta function. Subject to the existence of the parent density fx(x), the 
marginal density of Xr r:I (\ is 

W*) = B {r,K-r + l) [1 " Fx(x)] K ' r F^(x)f x (x), (14) 
where B(a,b) is the beta function. The joint density of all K order statistics is 

J(l:K),...,(K:K){Xl, X K ) = < (15) 

I 0, else. 

The region of support, = {xf : x\ < ■ ■ ■ < xk}, is a convex cone that occupies (l/if!)th of M K . 
The order statistics also have the Markov property [35] with transition probability 



fx (r+1:K) \X( T:K) =x(y) = {K-r) 



Fx{y) 



K-r-1 



fx(y) f 

for y > x. (16) 



l-F x (x) 



l-F x (x) 

In a standard quantization setup, the sorting filter dTJ would be applied first to generate the transform 
coefficients and then further source coding would be performed. Since sorting quantized numbers is easier 
than sorting real-valued numbers, we would prefer to be able to interchange the operations. Based on the 
form of the joint distribution of order statistics, (fT5T ). we can formulate a statement about when sorting 
and quantization can be interchanged without loss of optimality. If the order statistics are to be quantized 
individually using scalar quantization, then interchange without loss can be made in all cases [55]. Scalar 
quantization, however, does not take advantage of the Markovian dependence among elements to be 
coded. We consider coding the entire set together, referring to K as the dimension of the order statistic 
vector quantizer. 

If the representation points for an MSE-optimal (R rate, K dimension) order statistic quantizer are 
the intersection of 9\ with the representation points for an MSE-optimal (R + logi'T!, K) quantizer 
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Fig. 3. Quantization for bivariate standard Gaussian order statistics. Optimal one-bit quantizer (white) achieves (R — 1, D — 
(2n — 4)/n). Optimal two-bit quantizer (black) for unordered variates achieves (R — 2, D — (2n — 4)/tt). Since representation 
points for order statistic quantizer are the intersection of the cone (shaded) and the representation points for the unordered 
quantizer, the distortion performance is the same. 

for the unordered variates, then we can interchange sorting and quantization without loss of optimality. 
This condition can be interpreted as a requirement of permutation polyhedral symmetry on the quantizer 
of the unordered variates. This form of symmetry requires that there are corresponding representation 
points of the unordered variate quantizer in each of the Kl convex cones that partition R K on the basis 
of permutation. The polyhedron with vertices that are corresponding points in each of the Kl convex 
cones is a permutation polyhedron. In fact, the distortion performance of the MSE-optimal (R, K) order 
statistic quantizer is equal to the distortion performance of the best (R + log Kl, K) unordered quantizer 
constrained to have the required permutation symmetry. An example where the symmetry condition is 
met is for the standard bivariate Gaussian distribution shown in Figure [3] 

2) High-Rate Quantization Theory for Fixed-Size Multisets: Based on the basic distributional properties 
of order statistics, (fl4fr-([T6l). the differential entropies of order statistics can be derived. The individual 
marginal differential entropies are 



where no particular simplification is possible unless the parent distribution is specified. The average 
marginal differential entropy, however, can be expressed in terms of the differential entropy of the parent 




(17) 
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distribution and a constant that depends only on K [56]: 

K K 

fr(X (1:K) ,...,X (K: ^) = lf>(^ (18) 

i=l i=l ^ 1 ' 

The subtractive constant is positive and increasing in K, and not dependent on the parent distribution. 



The individual conditional differential entropies, as derived in [25], are 

■\ + 1 - 

K-r 



h(X (r+1:K) \X (r , K) ) = - \og{K - r) - N h {K) + N h (K - r) + 1 - — L_ (19) 



K\ 



T(K — r)T(r) 



fx(y)log(y) [1 - F x (y)] K - r - 1 dyF^ 1 (x)f x (x)dx, 

where T(-) is the gamma function and N^k) = Y2m=l V m * s tne harmonic number. As in the individual 
marginal case, further simplification of this expression requires the parent distribution to be specified. 
Again, as in the marginal case, the total conditional differential entropy can be expressed in terms of the 
parent differential entropy and a constant that depends only on K. Due to Markovianity, the sum of the 

individual conditional differential entropies is in fact the joint differential entropy: 

K-l 

h (X {1:K) , X (K:K) ) = h{X {1 , K) ) + Y, HX(i+l:K)\X(i:K)) = Kh{X 1 ) - \ogK\. (20) 

1=1 

Notice that an analogous statement © was a lower bound in the discrete alphabet case; equality holds 
in the continuous case since there are no ties. 

High-rate quantization results follow easily from the differential entropy calculations. To develop 
results, we introduce four quantization schemes in turn, measuring performance under fidelity criterion 
F4. In particular, we sequentially introduce a shape advantage, a memory advantage, and a space- 
filling advantage as in [57] 3 As a baseline, take the naive scheme of direct uniform scalar quantization 
of the arbitrarily-ordered sequence with quantization step size e. The average rate and distortion per 
source symbol of the naive scheme are R\ = h{X\) — loge, and D\ = e 2 /12. Now instead uniformly 
scalar quantize the deterministicaily-ordered sequence (the order statistics). This changes the shape of 
the marginal distributions that we are quantizing, and thus we get a shape advantage. The average rate 
per source symbol for this scheme is 



R 2 = h(X (1:K) ,. . . , X (K:K) ) - log e = Ri - log K - ^ log + K 1 



2 



(21) 



The distortion is the same as the naive scheme, D2 = Di. As a third scheme, scalar quantize the order 
statistics sequentially, using the previous order statistic as a form of side information. Even though the 

4 Note that vector quantizer advantages are discussed in terms of distortion for fixed rate in [57], but we present some of these 
advantages in terms of rate for fixed distortion. 
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Scheme 1 





1 


Scheme 2 (s) 


logK + i^logd" 1 )-^! 


1 


Scheme 3 (s,m) 


QogKl)/K 


1 


Scheme 4 (s,m,f) 


(log K\)/K 


1/G(K) 



TABLE I 

Comparison between the naive scalar quantization (Scheme 1) and several other quantization schemes. 
The symbols (s), (m), and (f) denote shape, memory, and space-filling advantages. 



encoding requires waiting for the entire block so as to sort, decoding can proceed symbol-by-symbol, 
reducing delay. We assume that the previous order statistics are known exactly to both the encoder and 
decoder. Since the order statistics form a Markov chain, this single-letter sequential transmission exploits 
all available memory advantage. The rate for this scheme is 

R 3 = ±h(X {1:K) , X {K:K) ) -\oge = R 1 -± log K\. (22) 

Again, D% = D\. Finally, the fourth scheme would vector quantize the entire sequence of order statistics 
collectively. Since we have exploited all shape and memory advantages, the only thing we can gain is 
space-filling gain. The rate is the same as the third scheme, R4 = R3, however the distortion is less. 
This distortion reduction is a function of K, is related to the best packing of polytopes, and is not 
known in closed form for most values of K; see [57, Table I] and more recent work on packings. We 
denote the distortion as D4 = D\/G(K), where G(K) is a function greater than unity. The performance 
improvements of these schemes are summarized in Table U Notice that all values in Table U depend only 
on the multiset length K and not on the parent distribution. 

We have introduced several quantization schemes and calculated their performance in the high-rate 
limit. It was seen that taking the fidelity criterion into account when designing the source coder resulted 
in rate savings that did not depend on the parent distribution. These rate savings can be quite significant 
for large blocklengths K. 

3) Rate Distortion for Fixed-Size Multisets: It is quite difficult to obtain the full rate-distortion function 
for the F4 fidelity criterion; however, upper and lower bounds may be quite close to each other for 
particular source distributions. As an example, consider the rate-distortion function for the independent 
bivariate standard Gaussian distribution that was considered in Figure [3] The rate-distortion function 
under F4 is equivalent to the rate-distortion function for the order statistics under the MSE fidelity 
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criterion, as shown. For clarity of expression, let X = {Xi)f =l and X = (Yi)f =l in the unsorted domain 
and Z = {Xi}f =l and Z = {Yj}^ in the sorted domain. Clearly, the fidelity constraint is naturally 
expressed in the Z domain. The affirmatively answered question is whether the mutual information in 
the rate distortion optimization can be switched from I(X; Z) to I(Z; Z): 



I(X- Z) = h(X) + h(Z) - h(X, Z) 

= h{Z) + h(J) + h(Z) - h{X, Z) 

= h(Z) + h{J) + h{Z) - \h(X, Z\J) + I{X, Z; J) 

= h(Z) + h(J) + h{Z) - h(Z, Z) - I(X, Z; J) 

= h(Z) + h{J) + h(Z) - h{Z, Z) - h{J) + h{J\X, Z) 

= I(Z; Z) + h(J) - h{J) = I(Z; Z). 

Step (a) is due to Theorem Q] and step (b) follows since h(J\X, Z) is zero. 
The Shannon lower bound is simply 

Rslb(D) = log(l/£>), 



(23) 



(24) 



A 



(25) 



the Gaussian rate-distortion function under the MSE fidelity criterion, reduced by log Kl bits (one bit). 
Note that since the order statistic source cannot be written as the sum of two independent processes, 
one of which has the properties of a Gaussian with variance the Shannon lower bound is loose 
everywhere [58], though it becomes asymptotically tight in the high-rate limit. 

The covariance matrix of the Gaussian order statistics can be computed in closed form as 

l-l/vr 1/vr 
1/tt 1-1/tt 

with eigenvalues 1 and 1 — 2 /it. Reverse waterfilling yields the Shannon upper bound 

' !log(^)+Ilog(£), 0< J D<2-4/vr 
#sub(£) = | \ log ( D _i +2/n ) , 2 - 4/vr < D < 2 - 2/vr (26) 

0, L>>2-2/vr. 
This bound is tight at the point achieved by zero rate. Since the Gaussian order statistics for K = 2 
have small non-Gaussianity, the Shannon lower bound and the Shannon upper bound are close to each 



5 Even though X(i :2 ) = |(Xi + X 2 ) - - X 2 \ and X( 2:2 ) = §(Xi + X 2 ) + - X 2 \, and the first terms are 

Gaussian, the troublesome part is the independence. 
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0.5 1 1.5 

D (average distortion) 



Fig. 4. Shannon upper and lower bounds for the Gaussian order statistic rate-distortion function. The point achievable by single 
set code of Figure [3] is also shown connected to the zero rate point, which is known to be tight. Note that rate is not normalized 
per source letter. 

other, as shown in Figure [4] For moderately small distortion values, we can estimate the rate-distortion 
function quite well. 

The fact that the Shannon lower bound is loose everywhere applies not only to the particular example 
we considered, but to any problem. That is to say, log(.KTj bits cannot be saved below the rate distortion 
function for the usual squared error fidelity criterion. 

Theorem 9. The Shannon lower bound to the rate distortion function is loose everywhere for any source, 
under the fidelity criterion F4. 

Proof: The support of the joint distribution (fl5l) for any order statistic source is the convex cone 
9t = {xf : x\ < ■ ■ ■ < xk}- The support of a Gaussian distribution is all of R . For the Shannon lower 
bound to be tight, the source must be decomposable as the sum of two independent processes, one of 
which has the properties of a Gaussian [58]. Since the Gaussian density has support over all space, it 
cannot be convolved with another density (non-negative) to yield a third density that has support over 
only part of space. ■ 

VI. Universal Lossy Coding 

The final setting in which we investigate the ramifications of order irrelevance is universal lossy 
coding. The general goal in universal source coding is to find encoding algorithms that perform well for 
all members of a class of sources [59]. Here we have the modest goal of demonstrating that O(logn) 
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rate requirements extend quite generally to lossy coding. The results we present are not intended to be 
conclusive, but rather are included to wind up our tour of source coding. 

Recall the main result of Section |V-A[ under a per-letter MSE fidelity criterion, zero distortion is 
achievable with zero total rate for a large class of sources. This result is obtained with the number of 
letters n growing without bound and the source distribution known. An interpretation of this is that 
using the known distribution to pseudorandomly "simulate" the source at the destination is sufficient for 
achieving zero distortion. Now we consider a universal setting in which this approach will not work 
because the source distribution is not known at the destination. 

Instead of giving a result for real- valued sources and multiset MSE, we jump directly to a more general 
result. Let d be a single-letter distortion measure, and let R*(D) denote an operational rate distortion 
function that is achievable by fixed-rate codes uniformly over a class of sources (coding the sources as 
sequences). As in Section IV-A21 for coding without regard to order, consider the single-letter distortion 
measure p n and associated fidelity criterion F\ given in ([T2l)-(fT"3l). Denote by R^ x } n C^) tne minimum 
(total) rate for encoding {X-i}^ =1 with E[p n {Xf,X™)] < D] for every source in the class. Then we 
obtain the following result analogous to Theorem [8j 

Theorem 10. If R*(D) is finite, then for any e > 0, 

R* {Xi}? jD + e) = 0(logn). 

Proof: Let D be such that R = R*(D) is finite and let e > 0. The achievability of R*(D) uniformly 
over the class means that there is a finite dimension N at which distortion D + e is achieved at rate R 
for every source in the class. Thus only minor adjustments to the proof of Theorem [8] are needed. ■ 
As a simple application consider a set of real-valued parent distributions that share a bounded support. 
An arbitrarily small multiset MSE can be obtained uniformly over all the sources with O(logn) rate. 
This follows from Theorem [TOl because the finiteness of R*(D) for any positive distortion D can be 
demonstrated by uniform quantization of the support of the source class. 

VII. Concluding Comments: From Sequences to Multisets 

We have completed a tour through the major areas of source coding while discussing how things are 
changed by irrelevance of the order of source letters. To conclude, we discuss three conceptual transitions 
between sequences and multisets and then summarize. 
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5 10 15 20 



Fig. 5. Logarithm of the number of binary sequences, multisets, or fc-gram multisets as a function of the number of binary 
letters. 

A. Types — > Markov Types — > Sequences 

In the Shannon-style language approximations that were mentioned in the opening, a first-order ap- 
proximation corresponds to a multiset of letters of the original alphabet, whereas an approximation of 
the same order as the length of the sequence is the sequence itself. In between these extremes, there 
are many possibilities: A second-order approximation is a multiset of digrams (ordered pairs of source 
letters), a third-order approximation is a multiset of trigrams, etc. Thus, as the approximation order is 
increased, the lengths of segments within which the ordering of letters is relevant increases. 

For a fixed alphabet, increasing the approximation order also causes the number of distinct outcomes to 
increase. For first-order approximations to n source letters drawn from alphabet X, the number of distinct 
outcomes is the number of types. For an ^th-order approximation, the number of distinct outcomes is 
the number of Markov type classes [60]-[62]. Markov type classes are also known as kinds in cognitive 
science [12], [13] and used to visualize motifs in computational genomics [63]. The enumeration of 
Markov types is not simply expressed [62], but can be upper bounded by (n + The number of 

Markov types gives an upper bound on the rate requirements for lossless coding and is computed exactly 
in Figure [5] There is no source that can achieve the enumeration upper bound for different values of £ 
simultaneously since it is impossible to have equiprobable sequences and multisets at the same time. For 
a real source, like the empirical source from the Zenith radio experiments in telepathy [64], the entropy 
is much lower than the bounds; see Table HTl 



February 1, 2008 



DRAFT 



29 



TABLE II 

Entropy of Zenith Radio Telepathy Data 





Entropy (bits) 


Bound (bits) 


sequence 


4.6663 


5.0000 


multiset of tetragrams 


4.5892 


4.9542 


multiset of trigrams 


4.3359 


4.8074 


multiset of digrams 


3.5411 


4.1699 


multiset 


1.8758 


2.5850 



B. Partially Commutative Alphabets 

Rather than varying ordering requirements by varying the segments over which order is relevant, one 
can allow particular letters in the alphabet to commute in position with other letters. As discussed in [65], a 
source with such a partially-commutative alphabet can be described by a noncommutation graph. As edges 
are removed from this graph, the importance of the order of letters decreases. In the case of the empty 
noncommutation graph, the order is irrelevant and the so-called lexicographic normal form associated with 
the noncommutation graph is simply the sequence sorted into order. The distinct outcomes associated 
with a noncommutation graph are called interchange classes and the moment-generating function for 
the number of interchange classes is equal to the inverse of the Mobius polynomial corresponding to a 
function of the noncommutation graph. Interchange entropies are discussed in detail by Savari [65]. 

If noncommutation graphs are defined on sliding windows of source letters rather than on individual 
letters, the problem becomes one of source coding with a 0-1 context-dependent fidelity criterion [20], 
[66]; sliding windows that have distortion zero between them commute. Since the sliding windows 
overlap, however, the commutation relations must be constrained to remain consistent. Just as sources 
with partially-commutative alphabets lead to type classes when the noncommutation graph is empty, 
sources with empty noncommutation graphs on sliding windows lead to Markov type classes. 

C. Quantum Physics 

In statistical physics, the Maxwell-Boltzmann statistics are used for non-interacting, identical bosons 
in the classical limit and correspond to sequences, whereas the Bose-Einstein statistics are used when 
quantum effects are manifested and correspond to multisets. In the classical regime, bosons of the same 
energy level, x € X, may be distinguished by their different positions in space. That is to say the 
order of particles is important. As the concentration of particles increases, some particles become so 
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close that they can no longer be distinguished in position, and degeneracy results. Thus degeneracy 
measures the importance of order in representing bosons. When the concentration of particles exceeds 
the quantum concentration, i.e. when the interparticle distance is less than the thermal de Broglie 
wavelength, the bosons become indistinguishable and so representable by a multiset. The combinatorics 
of indistinguishability as a function of particle concentration is given in the so-called partition function 
for bosons. The partition function allows distributional characterizations of intermediate levels of particle 
concentration to be made. 

Incidentally, at even greater concentrations than the quantum concentration, the probability mass of 
letters appearing in the multiset concentrates on a single letter, as determined by the average boson 
occupation number. This is known as Bose-Einstein condensation. 

D. Summary 

Partial or full order irrelevance has significant qualitative impact. For lossless coding of n letters from a 
finite-alphabet source, the rate requirement grows only logarithmically with n rather than linearly with n 
(Theorem [2]); the rate reduction as a ratio is thus arbitrarily large. In a universal setting, the rate reduction 
is again arbitrarily large: for a source satisfying Kieffer's condition for sequence representation [41], 
universal coding with n + o(n) bits is achievable (Theorem 0]). This should be compared with cn bits 
when order is relevant, where constant c could be arbitrarily large. Despite this positive statement about 
universal coding, it is impossible for the redundancy to be a negligible fraction of the coding rate 
(Section IIV-B4D . 

For lossy coding subject to per-letter MSE distortion, irrelevance of order can trivialize the source 
coding problem for a large class of sources. Specifically, under rather weak moment conditions on the 
parent distribution, zero distortion is achieved even with zero rate as n — > oo (Theorem [7]). This is not 
of practical importance because a source coder will process only a finite amount of data at once. High- 
resolution analyses of various quantization schemes for a block of size K are presented in Section IV-C2I 
Through the inclusion of shape and memory advantage — and under the assumption of high rate — a rate 
savings of log-fT! bits can be achieved relative to naive scalar quantization. However, in a rate distortion 
setting the "full" savings of log Kl bits can only be achieved as the rate approaches infinity, not at any 
finite rate (Theorem [9]). 
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